Introduction to Data

Spring 2026

To start class

  1. Clone your week 2 repo to your machine.
  2. Open a file to keep notes in - either a qmd or script.
  3. Ensure file is saved in the project folder per week.
  4. Repeat.

What the **** are data?

Data are a means to represent the world

Context Matters

Semantics

  • Row = observation, record, case, example, instance, pattern, sample
  • Columns = variable, field, feature, attribute, input, predictor, dimension

Categorical Data

  • Nominal: Unordered category
  • Ordinal: Ordered category
  • Both can be binary or multinomial

Numeric Data

  • Continuous: Can take on any number
    • Interval: Distance between values are equal and meaningful
      • Numbers are ‘arbitrary’ and lack a 0 point
      • IQ, temperature, etc.
    • Ratio: Defined 0 point. Cannot fall below 0.
  • Discrete: Can only take on certain numbers. There are ‘gaps’ between numbers.
    • Counts & Integers (whole numbers)

A Note about Research Design

  • Qualitative Research: Descriptive statements to seek answers
  • Quantitative research: measurements to seek answers from qualitative or quantitative data
    • Data Science
  • Less precise: Qualitative / categorical
  • More precise: Quantitative / continuous

Data Types & R

Common Data Types

Type Definition Example
Double Whole or floating number 5 or 5.73
Integer Whole number 5, 2, 3L
Character Individual or strings of non-numbers “c”, “cat”, “cat in the hat”
Factor Categorical or discrete variables M/F, S/M/L
Boolean Binary Categories T/F

Data Type in R

Numbers

[1]  4.12  4.57  5.00 17.00

Characters

[1] "M"       "male"    "F"       "cat"     "Cat-Dog"

Factors

[1] M M F M
Levels: F M

Boolean

[1]  TRUE FALSE  TRUE FALSE

Special Data Types




NULL
[1] NA
[1] NaN
[1] Inf

Data Modes

Each variable / object has a data mode that umbrellas by data type.

Numeric: * Both integers and doubles * Includes factors

Character: * Characters and strings

Logical: * Boolean TRUE and FALSE

The mode() function returns the type of data mode.

mode(42)
[1] "numeric"
a <- "beer"
mode(a)
[1] "character"
mode(T)
[1] "logical"
mode(as.factor("M"))
[1] "numeric"

Checking and Converting




is.numeric(2)
[1] TRUE
is.numeric(a)
[1] FALSE
is.character("a")
[1] TRUE
as.character(4)
[1] "4"
as.numeric(4)
[1] 4
  • It is often useful to make discrete categories (strings) into a factor. This allows for ease in analyses and visualizations.
fac <- c("M", "F")
mode(fac)
[1] "character"
fac <- factor(fac)
fac
[1] M F
Levels: F M
is.factor(fac)
[1] TRUE

Data Structures

But First! Importing Data!

  • Option 1: Environment Pane > Import
    • Point and click
  • Option 2: Code!
data <- read.csv(...) # Base R

library(readr)
read_csv(...)

library(readxl)
read_xls(...)
dat <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-04-01/pokemon_df.csv')

Rows: 949
Columns: 22
$ id              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ pokemon         <chr> "bulbasaur", "ivysaur", "venusaur", "charmander", "cha…
$ species_id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ height          <dbl> 0.7, 1.0, 2.0, 0.6, 1.1, 1.7, 0.5, 1.0, 1.6, 0.3, 0.7,…
$ weight          <dbl> 6.9, 13.0, 100.0, 8.5, 19.0, 90.5, 9.0, 22.5, 85.5, 2.…
$ base_experience <dbl> 64, 142, 236, 62, 142, 240, 63, 142, 239, 39, 72, 178,…
$ type_1          <chr> "grass", "grass", "grass", "fire", "fire", "fire", "wa…
$ type_2          <chr> "poison", "poison", "poison", NA, NA, "flying", NA, NA…
$ hp              <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60, 40, 45…
$ attack          <dbl> 49, 62, 82, 52, 64, 84, 48, 63, 83, 30, 20, 45, 35, 25…
$ defense         <dbl> 49, 63, 83, 43, 58, 78, 65, 80, 100, 35, 55, 50, 30, 5…
$ special_attack  <dbl> 65, 80, 100, 60, 80, 109, 50, 65, 85, 20, 25, 90, 20, …
$ special_defense <dbl> 65, 80, 100, 50, 65, 85, 64, 80, 105, 20, 25, 80, 20, …
$ speed           <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 70, 50, 3…
$ color_1         <chr> "#78C850", "#78C850", "#78C850", "#F08030", "#F08030",…
$ color_2         <chr> "#A040A0", "#A040A0", "#A040A0", NA, NA, "#A890F0", NA…
$ color_f         <chr> "#81A763", "#81A763", "#81A763", NA, NA, "#DE835E", NA…
$ egg_group_1     <chr> "monster", "monster", "monster", "monster", "monster",…
$ egg_group_2     <chr> "plant", "plant", "plant", "dragon", "dragon", "dragon…
$ url_icon        <chr> "//archives.bulbagarden.net/media/upload/7/7b/001MS6.p…
$ generation_id   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ url_image       <chr> "https://raw.githubusercontent.com/HybridShivam/Pokemo…

Data Structures

Vector == 1 Dimension

  • All elements have the same mode.
pokemon type_1
bulbasaur grass
ivysaur grass
venusaur grass
charmander fire
charmeleon fire
charizard fire
  • The c() function combines arguments into a vector.
v <- c(1, 2, 3)
w <- c(10, 11, 12)
c(v, w)
[1]  1  2  3 10 11 12
vec <- c("this", "is", "a", "vector")
vec
[1] "this"   "is"     "a"      "vector"

Matrix == 2 dimensions, same mode

Vectors combined together.

pokemon type_1 type_2
bulbasaur grass poison
ivysaur grass poison
venusaur grass poison
charmander fire NA
charmeleon fire NA
charizard fire flying
  • matrix() combines vectors.

matrix(data=NA, nrow=1, ncol=1, byrow=F, dimnames=F)

matrix(c(v, 4, 5, 6), nrow=2, ncol=2, byrow=T)
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Data Frames == 2 dimensions, different modes

  • AKA a tibble
  • Most common data R data structure.
  • Rectangular data [Rows X columns]
  • Each column has a single mode.(AKA column = vector)
pokemon type_1 hp
bulbasaur grass 45
ivysaur grass 60
venusaur grass 80

Arrays and Lists: 3+ Dimensions

  • Array: = Many matrices into 1 object / container.
    • 1 data mode
  • List: = Combination of all data structures
    • Multiple data modes & types
    • Complex but incredibly useful
  • Remember, there is a function for that!
array(...)
data.frame(...)
list(...)

# is.[...] / as.[...]

Note!

  • R is a computer programming language

    • Languages have dialects

      • Primary / Base: Pre-loaded
      • Others: Pull from library()
  • You have to walk before you can run… we will start with base

  • Remember R can only do what you ask it! It is hyper literal

Exploring Data

Take a Peak

  • head(x, n=6): return the first n elements
  • tail(x, n=6): return the last n elements
head(dat, n = 3)
id pokemon species_id height weight base_experience type_1 type_2 hp attack defense special_attack special_defense speed color_1 color_2 color_f egg_group_1 egg_group_2 url_icon generation_id url_image
1 bulbasaur 1 0.7 6.9 64 grass poison 45 49 49 65 65 45 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/7/7b/001MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/001.png
2 ivysaur 2 1.0 13.0 142 grass poison 60 62 63 80 80 60 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/a/a0/002MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/002.png
3 venusaur 3 2.0 100.0 236 grass poison 80 82 83 100 100 80 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/0/07/003MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/003.png
tail(dat, n = 2)
id pokemon species_id height weight base_experience type_1 type_2 hp attack defense special_attack special_defense speed color_1 color_2 color_f egg_group_1 egg_group_2 url_icon generation_id url_image
10146 kommo-o-totem 784 2.4 207.5 270 dragon fighting 75 110 125 100 105 85 #7038F8 #C03028 #8336C5 dragon NA NA NA https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/10146.png
10147 magearna-original 801 1.0 80.5 120 steel fairy 80 95 115 130 115 65 #B8B8D0 #EE99AC #C5B0C7 no-eggs NA NA NA https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/10147.png

Check the Structure

  • dim(x): return the dimensions of an object
    • Matrix, data frame, or array
    • rows, columns, (depth)
dim(dat)
[1] 949  22
mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow=3, ncol=2)
dim(mat)
[1] 3 2
## WARNING

a <- c(1, 2, 3, 4)
dim(a)
NULL

Check the Structure

  • Universal check
str(a)
 num [1:4] 1 2 3 4
str(dat)
spc_tbl_ [949 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id             : num [1:949] 1 2 3 4 5 6 7 8 9 10 ...
 $ pokemon        : chr [1:949] "bulbasaur" "ivysaur" "venusaur" "charmander" ...
 $ species_id     : num [1:949] 1 2 3 4 5 6 7 8 9 10 ...
 $ height         : num [1:949] 0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
 $ weight         : num [1:949] 6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
 $ base_experience: num [1:949] 64 142 236 62 142 240 63 142 239 39 ...
 $ type_1         : chr [1:949] "grass" "grass" "grass" "fire" ...
 $ type_2         : chr [1:949] "poison" "poison" "poison" NA ...
 $ hp             : num [1:949] 45 60 80 39 58 78 44 59 79 45 ...
 $ attack         : num [1:949] 49 62 82 52 64 84 48 63 83 30 ...
 $ defense        : num [1:949] 49 63 83 43 58 78 65 80 100 35 ...
 $ special_attack : num [1:949] 65 80 100 60 80 109 50 65 85 20 ...
 $ special_defense: num [1:949] 65 80 100 50 65 85 64 80 105 20 ...
 $ speed          : num [1:949] 45 60 80 65 80 100 43 58 78 45 ...
 $ color_1        : chr [1:949] "#78C850" "#78C850" "#78C850" "#F08030" ...
 $ color_2        : chr [1:949] "#A040A0" "#A040A0" "#A040A0" NA ...
 $ color_f        : chr [1:949] "#81A763" "#81A763" "#81A763" NA ...
 $ egg_group_1    : chr [1:949] "monster" "monster" "monster" "monster" ...
 $ egg_group_2    : chr [1:949] "plant" "plant" "plant" "dragon" ...
 $ url_icon       : chr [1:949] "//archives.bulbagarden.net/media/upload/7/7b/001MS6.png" "//archives.bulbagarden.net/media/upload/a/a0/002MS6.png" "//archives.bulbagarden.net/media/upload/0/07/003MS6.png" "//archives.bulbagarden.net/media/upload/7/7d/004MS6.png" ...
 $ generation_id  : num [1:949] 1 1 1 1 1 1 1 1 1 1 ...
 $ url_image      : chr [1:949] "https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/001.png" "https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/002.png" "https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/003.png" "https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/004.png" ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_double(),
  ..   pokemon = col_character(),
  ..   species_id = col_double(),
  ..   height = col_double(),
  ..   weight = col_double(),
  ..   base_experience = col_double(),
  ..   type_1 = col_character(),
  ..   type_2 = col_character(),
  ..   hp = col_double(),
  ..   attack = col_double(),
  ..   defense = col_double(),
  ..   special_attack = col_double(),
  ..   special_defense = col_double(),
  ..   speed = col_double(),
  ..   color_1 = col_character(),
  ..   color_2 = col_character(),
  ..   color_f = col_character(),
  ..   egg_group_1 = col_character(),
  ..   egg_group_2 = col_character(),
  ..   url_icon = col_character(),
  ..   generation_id = col_double(),
  ..   url_image = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Indexing Data

  • Index: = an address to a value(s)


# Vectors
b <- c(1, 3, 5, 7, 9)
b
[1] 1 3 5 7 9
b[3]
[1] 5
  • Matrices / DF:
    • 2 dimensions == 2 index values
    • ALWAYS ROWS x COLUMNS

id pokemon species_id height weight base_experience type_1 type_2 hp attack defense special_attack special_defense speed color_1 color_2 color_f egg_group_1 egg_group_2 url_icon generation_id url_image
1 bulbasaur 1 0.7 6.9 64 grass poison 45 49 49 65 65 45 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/7/7b/001MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/001.png
2 ivysaur 2 1.0 13.0 142 grass poison 60 62 63 80 80 60 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/a/a0/002MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/002.png
3 venusaur 3 2.0 100.0 236 grass poison 80 82 83 100 100 80 #78C850 #A040A0 #81A763 monster plant //archives.bulbagarden.net/media/upload/0/07/003MS6.png 1 https://raw.githubusercontent.com/HybridShivam/Pokemon/master/assets/images/003.png
dat[2,1]
id
2
dat[2, 1:3]
id pokemon species_id
2 ivysaur 2

Get a Summary

  • summary() returns the summary statistics for an object.
summary(dat[,c(2,4,5)])
   pokemon              height           weight      
 Length:949         Min.   : 0.100   Min.   :  0.10  
 Class :character   1st Qu.: 0.500   1st Qu.:  8.50  
 Mode  :character   Median : 1.000   Median : 28.80  
                    Mean   : 1.228   Mean   : 66.21  
                    3rd Qu.: 1.500   3rd Qu.: 66.60  
                    Max.   :14.500   Max.   :999.90